[VL] Add lazy per-column deserialization for Columnar Table Cache by jackylee-ch · Pull Request #12211 · apache/gluten

jackylee-ch · 2026-06-01T04:57:56Z

What changes

This PR makes Velox table cache write V3 per-column framed bytes by default. Lazy materialization is a base table-cache capability; spark.gluten.sql.columnar.tableCache.partitionStats.enabled now only controls the optional stats/pruning payload.

Removes spark.gluten.sql.columnar.tableCache.lazy.deserialization.enabled.
Adds V3 no-stats serialization (statsLen=0) for the default lazy path.
Keeps V3 with stats for partition pruning when partition stats are enabled.
Keeps V2 stats and legacy raw bytes as native-capability / backward-read fallback paths.
Routes V3 cached bytes through projected native deserialization.
Adds JVM/native golden, lazy serde, and GHA benchmark coverage.

Performance

Four-environment comparison — eager V2 vs lazy V3, each without and with the optional
partition-stats payload (ColumnarTableCacheLazyDeserBenchmark):

V2 without stats = legacy raw Presto payload (eager full-batch decode, no pruning).
V2 with stats = framedSerializeWithStats (eager full-batch decode + partition-stats pruning).
V3 without stats = per-column lazy payload (default; lazy projected decode).
V3 with stats = per-column lazy payload + partition-stats pruning.

100M rows / 32 partitions / 16 columns / 3 iterations, Apple M5 Pro, JDK 8 runtime, real Gluten
(off-heap enabled, ColumnarCachedBatchSerializer). Read phases build one mode's cache at a time so
the full 100M fits. Times are avg ms, lower is better; relative is vs V2 without stats.

Cache footprint (storage memory)

Mode	Footprint
V2 without stats	14542 MiB
V2 with stats	14558 MiB
V3 without stats	14543 MiB
V3 with stats	14565 MiB

Footprint is identical across all four modes — V3 per-column framing does not regress cache size
for flat data, and the stats payload is negligible.

Read latency (avg ms / relative speedup vs V2 no-stats)

Phase	V2 no-stats	V2 +stats	V3 no-stats	V3 +stats
read 1/16 cols, `sum(c0)`	8217 (1.0x)	7427 (1.1x)	1110 (7.4x)	1050 (7.8x)
read 4/16 cols, group+agg	9325 (1.0x)	8569 (1.1x)	2648 (3.5x)	2692 (3.5x)
filter + 2/16 cols (point lookup)	8232 (1.0x)	69.5 (118x)	1210 (6.8x)	60.6 (136x)

Projected reads: V3 lazy decodes only the requested columns, so it is 7.4x faster reading
1 of 16 columns and 3.5x faster reading 4 of 16, versus eager V2 which decodes all 16.
Filtered point lookup: partition stats prune almost all batches (V2 +stats 118x), and V3
additionally lazy-decodes only the surviving batches' projected columns, giving the best result at
136x (V3 with stats). Lazy column-skip alone (V3 no-stats) is 6.8x.
All-columns read (decode everything, no skip) was measured separately at smaller scale and is
on par with / slightly faster than V2 (V3 ~1.3x at 2M), confirming LazyVector adds no overhead
when every column is materialized. It is omitted from the 100M table because the eager-V2 path
decodes the full 100M x 16 off-heap and does not fit this 64 GiB laptop.

Net: V3 lazy per-column is a large win on projected/filtered reads (the common table-cache access
pattern) with identical cache footprint and no full-scan regression.

A GitHub Actions run on a larger-RAM runner can reproduce the same 100M comparison via the
Velox Backend (x86) workflow_dispatch benchmark job.

How was this patch tested?

./dev/format-scala-code.sh
PATH="/opt/homebrew/opt/llvm@15/bin:$PATH" ./dev/format-cpp-code.sh
git diff --check upstream/main..HEAD
ruby -e 'require "yaml"; YAML.load_file(".github/workflows/velox_backend_x86.yml"); puts "yaml ok"'
./.github/workflows/util/check.sh upstream/main
env CCACHE_DIR=/private/tmp/gluten-ccache ninja -C cpp/build velox/tests/CMakeFiles/velox_operators_test.dir/VeloxColumnarBatchSerializerTest.cc.o
./build/mvn install -pl backends-velox -am -Pspark-3.5 -Pscala-2.12 -Pbackends-velox -DskipTests -Dexec.skip
Local benchmark runability smoke only, not used as PR performance data: Java 8, ColumnarTableCacheLazyDeserBenchmark with 1000 rows, 4 partitions, 1 iteration, phases build,read1,read4,readAll,filter.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex GPT-5

github-actions · 2026-06-01T04:58:26Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-01T08:59:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-01T09:06:27Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-01T09:09:05Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-01T11:22:01Z

Run Gluten Clickhouse CI on x86

jackylee-ch · 2026-06-01T14:21:24Z

@yaooqinn PTAL

yaooqinn · 2026-06-02T05:19:34Z

Thanks @jackylee-ch, V3 layout is a sensible extension of the cache-stats wire we landed in #12092 / #12196. Several things to discuss before this lands:

1. Benchmark needs to be re-run. The checked-in -results.txt is 10K rows / 4 partitions / 1 iteration on an Apple M5 Pro — Stdev=0 across the board because there's only one sample. Differences in the 1-3 ms range (e.g. "1.1X" at all-16-cols read, where lazy mode physically cannot be faster than eager) are noise. Also build 1.9X is surprising because V3 does N serializeSingleColumn calls vs V2's single-pass batchSerialize — the ordering legacy > V2 > V3 doesn't match the physical work done; this needs reruns on a server / GHA-equivalent runner with iter≥3 and 100M rows / 32 partitions (matching the code defaults). Please also add a cache memory footprint column — V3 per-col framing + getFlattenedRowVector() flattening Dictionary/Constant encodings could regress cache size significantly for dict-encoded payloads, and that's currently unmeasured.

2. Do we really need a new SQLConf? V3 functionally supersedes V2 (V3 frames also carry statsBlob), so this isn't a new behavioral feature — it's a wire-format upgrade. Adding a dedicated lazy.deserialization.enabled boolean commits Gluten to maintaining three cache paths (legacy / V2-stats / V3-lazy-and-stats) and a three-level fallback chain. Once we trust V3, we'd want to deprecate V2-stats, which means another deprecation cycle. Could we either (a) skip the conf and gate V3 behind partitionStats.enabled once it's stable, or (b) turn partitionStats.enabled into a string conf with off | v2 | v3 values? Configuration.md already warns "V3 is NOT backward compatible with V2 readers" + default=false — operationally nobody is going to flip this, so the conf risks being long-lived dead code.

3. Cross-language test parity vs #12196. V3 has no cpp-side byte-equal golden test; JVM-side tests synthesize their own frames via craftV3Framed. We just established the cpp-golden ↔ JVM-parser round-trip pattern in #12196 specifically because layout drift between halves is a correctness hazard. V3 needs the same: a framedSerializeWithStatsV3Golden cpp test pinning a byte-stable literal + a JVM parser round-trip over that same literal.

4. Smaller items.

All-null column case not covered (we hit the PrestoSerde uninit-values bug in [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning #12092 development, same risk class for per-col path).
getFlattenedRowVector() side effect on Dictionary/Constant encoding not documented.
The // JNI pin outlives comment in deserializeV3 describes a non-issue (copies are made synchronously in step 6, the lazy loader doesn't depend on the pin) — please trim.
Two near-identical magic checks (parseFramedBytes byte[3] dispatch vs isV3Format 4-byte compare) — please consolidate.
Consider folding statsExtV3AvailableFlag and statsExtAvailableFlag into a single capability enum (Unknown | V2 | V3 | Unavailable) — two independent one-shot latches double the operational diagnosis surface.

Happy to file any of these as separate issues if it helps.

github-actions · 2026-06-02T06:24:48Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-02T06:31:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-02T15:50:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T03:45:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T03:59:49Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T06:34:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T06:52:56Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T09:30:19Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T15:07:56Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T16:23:14Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T16:31:02Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-03T20:01:00Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-06-04T04:35:41Z

Run Gluten Clickhouse CI on x86

Write V3 per-column cache bytes by default for Velox table cache. Partition stats now only controls the optional stats/pruning payload: stats off writes a no-stats V3 frame, stats on writes V3 with stats, and older native libraries still fall back to V2 stats or legacy bytes. Add the V3 no-stats JNI/native serializer, JVM parsing for statsLen=0, cross-language golden coverage, and GitHub Actions benchmark execution without committing local benchmark results. Change-Id: I2a8582f901fafd436cac1a1d16e0367e9330b336

github-actions · 2026-06-04T08:16:29Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX DOCS labels Jun 1, 2026

jackylee-ch marked this pull request as draft June 1, 2026 04:58

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 58bd451 to d5a0502 Compare June 1, 2026 08:59

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from d5a0502 to 8e374db Compare June 1, 2026 09:05

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8e374db to 0f0ccd2 Compare June 1, 2026 09:08

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 0f0ccd2 to 8b09d6b Compare June 1, 2026 11:21

jackylee-ch marked this pull request as ready for review June 1, 2026 14:20

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 8b09d6b to 09679ee Compare June 2, 2026 06:24

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 09679ee to ab9e0f7 Compare June 2, 2026 06:30

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab9e0f7 to 144e816 Compare June 2, 2026 06:47

github-actions Bot removed the CORE works for Gluten Core label Jun 2, 2026

jackylee-ch force-pushed the table-cache-lazy-deserialization branch 2 times, most recently from b77f4ab to 9a0f96a Compare June 2, 2026 07:28

github-actions Bot removed the DOCS label Jun 2, 2026

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9a0f96a to b5b1906 Compare June 2, 2026 09:01

github-actions Bot added the INFRA label Jun 2, 2026

jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from 2b96545 to c3cc1bd Compare June 2, 2026 15:28

github-actions Bot added the CORE works for Gluten Core label Jun 2, 2026

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from c3cc1bd to 97a6019 Compare June 3, 2026 03:42

github-actions Bot added the DOCS label Jun 3, 2026

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 97a6019 to 9971c91 Compare June 3, 2026 03:52

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 9971c91 to f576df8 Compare June 3, 2026 06:33

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f576df8 to f17dc6a Compare June 3, 2026 06:51

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from f17dc6a to cda20eb Compare June 3, 2026 09:27

jackylee-ch force-pushed the table-cache-lazy-deserialization branch 3 times, most recently from decdd0e to ab055c5 Compare June 3, 2026 14:16

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from ab055c5 to 2538fe5 Compare June 3, 2026 18:55

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 2538fe5 to 765794f Compare June 4, 2026 04:35

jackylee-ch force-pushed the table-cache-lazy-deserialization branch from 765794f to c7f9e2f Compare June 4, 2026 07:54

github-actions Bot removed the INFRA label Jun 4, 2026

Conversation

jackylee-ch commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Performance

Cache footprint (storage memory)

Read latency (avg ms / relative speedup vs V2 no-stats)

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

jackylee-ch commented Jun 1, 2026

Uh oh!

yaooqinn commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackylee-ch commented Jun 1, 2026 •

edited

Loading